This lecture explores methods for transforming data, focusing on aggregation.
We will be mostly following Chapter 3 in the data visualization (Heer et al.) book
Fundamental problem in data visualization: in most cases, you do not want to show every single data point in your dataset.
Instead, you want to extract patterns which you (the analyst) think are interesting.
Aggregation
One nice thing about altair is that it nudges you to aggregate.
One example: if you try to make a plot with 10,000 dots, it will give you an error: MaxRowsError: The number of rows in your dataset is greater than the maximum allowed (5000).
Help file: “This is not because Altair cannot handle larger datasets, but it is because it is important for the user to think carefully about how large datasets are handled.”
import pandas as pdimport altair as altmovies_url ='https://cdn.jsdelivr.net/npm/vega-datasets@1/data/movies.json'
An aside on JSON
All the sample vega-datasets are stored as .json, including movies
Other places you have or will see JSON
Recall from viz lecture 1 that altair writes Vega-lite, which is also recorded in JSON
In spatial lectures, we will also encounter the geojson format, which can store geographic features
JSON shines at storing hierarchical data in a way that is easily readable to both humans and machines. Family tree example, wikipedia example.
The movies dataset, however, is just a boring table like any other. We will convert it to pandas for Altair to use
movies = pd.read_json(movies_url)
head()
movies.head(5)
Title
US_Gross
Worldwide_Gross
US_DVD_Sales
Production_Budget
Release_Date
MPAA_Rating
Running_Time_min
Distributor
Source
Major_Genre
Creative_Type
Director
Rotten_Tomatoes_Rating
IMDB_Rating
IMDB_Votes
0
The Land Girls
146083.0
146083.0
NaN
8000000.0
Jun 12 1998
R
NaN
Gramercy
None
None
None
None
NaN
6.1
1071.0
1
First Love, Last Rites
10876.0
10876.0
NaN
300000.0
Aug 07 1998
R
NaN
Strand
None
Drama
None
None
NaN
6.9
207.0
2
I Married a Strange Person
203134.0
203134.0
NaN
250000.0
Aug 28 1998
None
NaN
Lionsgate
None
Comedy
None
None
NaN
6.8
865.0
3
Let's Talk About Sex
373615.0
373615.0
NaN
300000.0
Sep 11 1998
None
NaN
Fine Line
None
Comedy
None
None
13.0
NaN
NaN
4
Slam
1009819.0
1087521.0
NaN
1000000.0
Oct 09 1998
R
NaN
Trimark
Original Screenplay
Drama
Contemporary Fiction
None
62.0
3.4
165.0
shape
movies.shape
(3201, 16)
With 3201 movies, we are under the 5000 obs limit from altair.
However, as you will see soon, it’s hard to learn much from a plot with 3201 observations.
This is why this is a good case study for the value of transformations if we want to uncover any patterns in the data
Variables of interest
Rotten Tomatoes ratings are determined by taking “thumbs up” and “thumbs down” judgments from film critics and calculating the percentage of positive reviews.
IMDB ratings are formed by averaging scores (ranging from 1 to 10) provided by the site’s users.
Exploring the raw data
alt.Chart(movies_url).mark_circle().encode( alt.X('Rotten_Tomatoes_Rating:Q', title ="Rotten Tomatoes Rating (%)"), alt.Y('IMDB_Rating:Q', title ="IMDB Rating"))
Recall from last lecture: label when scale is %!
Aggregation
Aggregation: roadmap
In previous lectures, we actually already saw aggregation via average() and min(). We just didn’t talk explicitly about that step. Now, we examine it more carefully.
movies = movies.dropna(subset=['Major_Genre'])alt.Chart(movies).mark_bar().encode( alt.X('average(Rotten_Tomatoes_Rating):Q', title ="Average Rotten Tomatoes Rating (%)"), alt.Y('Major_Genre:N', title ="Genre"))
This plot is fine, but hard to interpret takeaways quickly.
average() with sort(...)
More useful: sort the bars vertically, based on x-axis encoding
alt.Chart(movies).mark_bar().encode( alt.X('average(Rotten_Tomatoes_Rating):Q', title ="Average Rotten Tomatoes Rating (%)"), alt.Y('Major_Genre:N', title ="Genre", sort=alt.EncodingSortField(op='average', field='Rotten_Tomatoes_Rating', order='descending') ))
This focuses the viewer’s attention on which movie types are most and least popular
Interquartile range
Plot 1st and 3rd quartiles, then sort by median.
alt.Chart(movies).mark_bar().encode( alt.X('q1(Rotten_Tomatoes_Rating):Q', title ="Rotten Tomatoes Rating (%)"), alt.X2('q3(Rotten_Tomatoes_Rating):Q'), alt.Y('Major_Genre:N', sort=alt.EncodingSortField(op='median', field='Rotten_Tomatoes_Rating', order='descending'), title ="Genre" ))
Discussion question: what can you learn from the IQR plot that you could not learn from the plot with just average()?
alt.Chart(movies).mark_point().encode( alt.X('Release_Date:T', title ="Release Date"), alt.Y('US_Gross:Q', title ="US Gross ($)"))
Obviously we need to aggregate.
Also: what bug in the data does this plot reveal?
Do-pair-share
Do – make a plot on your own
Pair – compare your results with person next to you
Share – discuss results as a class
Question: What time of year are the highest grossing films released? Aggregate both the x- and the y-variables.
There are several ways to approach answering this question. What seems most reasonable to you?
Data aggregation: Do-pair-share
Starter code in lecture dps_highest_grossing_film.qmd file:
import pandas as pdimport altair as alt movies_url ='https://cdn.jsdelivr.net/npm/vega-datasets@1/data/movies.json'movies = pd.read_json(movies_url)# unaggregated scatter plotalt.Chart(movies).mark_point().encode( alt.X('Release_Date:T', title ="Release Date"), alt.Y('US_Gross:Q', title ="US Gross ($)"))
One way to think of these verbs is that they are fundamental to any data analysis project and so in any/every package you learn, you need to know how to do these.
Purpose
Vega
pandas equivalent
Define a new variable
transform_calculate()
df['new_col']
Filter to subset of rows
transform_filter(cond)
df.loc[cond]
Aggregate function - collapse number of rows down to one per group
transform_aggregate(groupby(...))
df.groupby('A').agg('mean')
Window function - transform across multiple rows, keeps same num. of rows)
transform_window(sum())
df['values'].cumsum()
Connection to pandas operations
You already know how to do these all in pandas so it is not conceptually new.
Why bother doing it in altair?
Exploratory data analysis can be done faster in altair: manipulate data and plot simultaneously
Aggregation and transformations are temporary – don’t need to define and keep track of new aggregated dataframes
transform_calculate and transform_timeunit
transform_calculate() uses expressions for writing basic formulas